Assignment instructions:

Assignment submission (YOUR NAME): _____Nicole Pepper_________________________________

Collaborators: Carmen Hoyt
r # ---- load libraries ---- library(tidyverse) library(here) library(janitor) library(estimatr) library(performance) library(jtools) library(gt) library(gtsummary) library(MASS) ## NOTE: The `select()` function is masked. Use: `dplyr::select()` ## library(interactions)

DATA SOURCE:

Reed D. 2019. SBC LTER: Reef: Abundance, size and fishing effort for California Spiny Lobster (Panulirus interruptus), ongoing since 2012. Environmental Data Initiative. https://doi.org/10.6073/pasta/a593a675d644fdefb736750b291579a0. Dataset accessed 11/17/2019.


Introduction

You’re about to dive into some deep data collected from five reef sites in Santa Barbara County, all about the abundance of California spiny lobsters! Data was gathered by divers annually from 2012 to 2018 across Naples, Mohawk, Isla Vista, Carpinteria, and Arroyo Quemado reefs.

Why lobsters? Well, this sample provides an opportunity to evaluate the impact of Marine Protected Areas (MPAs) established on January 1, 2012 (Reed, 2019). Of these five reefs, Naples, and Isla Vista are MPAs, while the other three are not protected (non-MPAs). Comparing lobster health between these protected and non-protected areas gives us the chance to study how commercial and recreational fishing might impact these ecosystems.

We will consider the MPA sites the treatment group and use regression methods to explore whether protecting these reefs really makes a difference compared to non-MPA sites (our control group). In this assignment, we’ll think deeply about which causal inference assumptions hold up under the research design and identify where they fall short.

Let’s break it down step by step and see what the data reveals! 📊


Step 1: Anticipating potential sources of selection bias

a. Do the control sites (Arroyo Quemado, Carpinteria, and Mohawk) provide a strong counterfactual for our treatment sites (Naples, Isla Vista)? Write a paragraph making a case for why this comparison is centris paribus or whether selection bias is likely (be specific!).

There are a lot of location-specific variables in marine ecosystems that can’t be fully controlled for, so I think that selection bias is likely to be present not only between the control and treatment sites, but also within the treatment and control groups themselves. From what I found on wildlife.ca, the two treatment sites have different fishing regulations; while Isla Vista is a “No-Take” zone, Naples allows recreational take of some pelagic fish and for the commercial take of kelp. I imagine that the difference in fishing behavior could likely influence the broader marine ecosystem, including for lobsters, which could introduce bias within the treatment group. Additionally, the sites, though all located in the greater Santa Barbara region, are dispersed broadly across the coast and likely have different baselines in variables such as proximity to pollution, habitat disturbance, among other environmental/human factors that may introduce challenges in comparing the sites. Additionally, because there are no hard boundaries between the datasets, there is a risk that effects from the treatment could spillover and have an impact on the control sites. With this in mind, I am not confident that the control sites provide a strong counterfactual for our treatment sites.


Step 2: Read & wrangle data

a. Read in the raw data. Name the data.frame (df) rawdata

b. Use the function clean_names() from the janitor package

# ---- read in project data ----
rawdata <- read.csv(here::here("data","spiny_abundance_sb_18.csv"), 
                    na.strings = "-99999") |> # set to NA
    clean_names() # clean column names

c. Create a new df named tidydata. Using the variable site (reef location) create a new variable reef as a factor and add the following labels in the order listed (i.e., re-order the levels):

"Arroyo Quemado", "Carpinteria", "Mohawk", "Isla Vista",  "Naples"
# ---- clean & prep data ----

# create labels for full site name and set to ordered factor
tidydata <- rawdata |>
    mutate(reef = factor( site, 
                          levels = c("AQUE", "CARP", "MOHK", "IVEE",  "NAPL"),
                          labels = c("Arroyo Quemado", "Carpinteria", "Mohawk", "Isla Vista",  "Naples")
    ))

Create new df named spiny_counts

d. Create a new variable counts to allow for an analysis of lobster counts where the unit-level of observation is the total number of observed lobsters per site, year and transect.

e. Create a new variable mpa with levels MPA and non_MPA. For our regression analysis create a numerical variable treat where MPA sites are coded 1 and non_MPA sites are coded 0

# ---- summarise and prepare lobster data for regression analysis ----

spiny_counts <- tidydata |>
    
    # group lobster data by site, reef, year, & transect
    group_by(site, year, transect) |>
    
    # count lobsters observed at each site-year-transect observation & mean size
    summarize(
        counts = sum(count, na.rm = TRUE),
        mean_size = mean(size_mm, na.rm = TRUE)
    ) |>
    ungroup() |>
    
    # create variables to distinguish MPA vs non-MPA sites
    mutate(
        mpa = case_when(
            site %in% c( "IVEE",  "NAPL") ~ "MPA",
            TRUE ~ "non_MPA"
            ),
        treat = if_else(mpa == "non_MPA", 0,1)
    )

NOTE: This step is crucial to the analysis. Check with a friend or come to TA/instructor office hours to make sure the counts are coded correctly!


Step 3: Explore & visualize data

a. Take a look at the data! Get familiar with the data in each df format (tidydata, spiny_counts)

# ---- explore data ----

head(spiny_counts)
## # A tibble: 6 × 7
##   site   year transect counts mean_size mpa     treat
##   <chr> <int>    <int>  <int>     <dbl> <chr>   <dbl>
## 1 AQUE   2012        1      5      64.2 non_MPA     0
## 2 AQUE   2012        2      9      66   non_MPA     0
## 3 AQUE   2012        3      0     NaN   non_MPA     0
## 4 AQUE   2012        4      9      74.1 non_MPA     0
## 5 AQUE   2012        5     11      76.9 non_MPA     0
## 6 AQUE   2012        6      0     NaN   non_MPA     0
head(tidydata)
##   year month       date site transect replicate size_mm count num_ao area
## 1 2012     8 2012-08-20 IVEE        1         A      NA     0      0  300
## 2 2012     8 2012-08-20 IVEE        1         B      NA     0      0  300
## 3 2012     8 2012-08-20 IVEE        1         C      NA     0      0  300
## 4 2012     8 2012-08-20 IVEE        1         D      NA     0      0  300
## 5 2012     8 2012-08-20 IVEE        2         A      NA     0      0  300
## 6 2012     8 2012-08-20 IVEE        2         B      NA     0      0  300
##         reef
## 1 Isla Vista
## 2 Isla Vista
## 3 Isla Vista
## 4 Isla Vista
## 5 Isla Vista
## 6 Isla Vista
dim(spiny_counts)
## [1] 252   7
dim(tidydata)
## [1] 4362   11

b. We will focus on the variables count, year, site, and treat(mpa) to model lobster abundance. Create the following 4 plots using a different method each time from the 6 options provided. Add a layer (geom) to each of the plots including informative descriptive statistics (you choose; e.g., mean, median, SD, quartiles, range). Make sure each plot dimension is clearly labeled (e.g., axes, groups).

Create plots displaying the distribution of lobster counts:

  1. grouped by reef site
  2. grouped by MPA status
  3. grouped by year
#|

# ---- Grouped by reef site ----

# Create a violin plot of lobster counts by reef site
spiny_counts |> 
ggplot(aes(x = counts, y = site)) +
  geom_violin(fill = "cornflowerblue") +
    # Add box plot to display quartiles
    geom_boxplot(alpha = 0) +
    # Add point for mean value
    stat_summary(aes(x = counts, y = site),
                 fun = "mean",
                 shape = 3) + 
    labs(title = "Lobster Counts at Study Sites",
         subtitle = "+ indicates the mean count",
         x = "Lobster Count",
         y = "Site Name")

# ---- Grouped by year ----

# Create a bar chart of lobster counts by year
spiny_counts |>
ggplot(aes(x = year, y = counts, fill = mpa)) + 
    geom_col(position = "dodge") +
    
    # Add point for median
    stat_summary(aes(group = mpa),
                 fun = "median",
                 shape = 4,
                 position = position_dodge(width = 0.95)) +
    labs(title = "Lobster Counts by Year by MPA Status",
         subtitle = "X indicates the median count",
         x = "Year",
         y = "Count",
         fill = "MPA Status") 

# ---- Grouped by mpa status ----

# Create a jitter plot of counts by reef, grouped by mpa status
spiny_counts |>
ggplot(aes(x = counts, y = mpa, color = mpa)) +
  geom_jitter(alpha = 0.5) + 
    
    #add boxplot to display quartiles
    geom_boxplot(alpha = 0, color = "black", size = .65) +
    
    # Add point for median value
    stat_summary(aes(x = counts, y = mpa),
                 fun = "mean",
                 shape = 3,
                 color = "black") +
    
    labs(title = "Lobster Counts by Site & MPA Status",
         subtitle = "+ indicates the mean count",
         x = "Year",
         y = "Count",
         color = "MPA Status")

Create a plot of lobster size :

  1. You choose the grouping variable(s)!
# ---- Create a plot of mean lobster size ----

spiny_counts |>
    ggplot(aes(x = mean_size, fill = mpa)) +
    geom_density(alpha = 0.8) + 
    
    # Add vertical line for median count for MPA group
     geom_vline(data = spiny_counts |> filter(mpa == "MPA"),
                aes(xintercept = median(mean_size, na.rm = TRUE)), 
             color = "#B34640", 
             linetype = "dashed", 
             size = 1) +
    
    # Add vertical line for non-MPA group
    geom_vline(data = spiny_counts |> filter(mpa != "MPA"), aes(xintercept = median(mean_size, na.rm = TRUE)), 
             color = "#235959", 
             linetype = "dashed", 
             size = 1) +
    
    labs(title = "Distribution of Lobster Size by MPA Status",
         x = "Mean Lobster Size (mm)",
         y = "Density",
         fill = "MPA Status") 

c. Compare means of the outcome by treatment group. Using the tbl_summary() function from the package gt_summary

# Compare means of the outcome by treatment group

spiny_counts |> 
    ungroup() |>
    dplyr::select(treat, counts) |>
    tbl_summary(
        by = treat,
        statistic = list(all_continuous() ~ "{mean} ({sd})"))|>
    modify_header(label ~ "Variable")|>
    modify_spanning_header(c("stat_1", "stat_2") ~ "**Treatment**")
Variable
Treatment
0
N = 133
1
1
N = 119
1
counts 23 (39) 28 (44)
1 Mean (SD)

Step 4: OLS regression- building intuition

a. Start with a simple OLS estimator of lobster counts regressed on treatment. Use the function summ() from the jtools package to print the OLS output

b. Interpret the intercept & predictor coefficients in your own words. Use full sentences and write your interpretation of the regression results to be as clear as possible to a non-academic audience.

# Define simple OLS model of treatment impact on lobster counts
m1_ols <- lm(
    counts ~ treat,
    data = spiny_counts
)

# Print the model output
summ(m1_ols, model.fit = FALSE) 
Observations 252
Dependent variable counts
Type OLS linear regression
Est. S.E. t val. p
(Intercept) 22.73 3.57 6.36 0.00
treat 5.36 5.20 1.03 0.30
Standard errors: OLS

The simple OLS model predicted that there is a 5% increase in lobster abundance at the treatment sites compared to the baseline. c. Check the model assumptions using the check_model function from the performance package

# Check model
check_model(m1_ols,  check = "qq" )

d. Explain the results of the 4 diagnostic plots. Why are we getting this result?

In OLS there is an assumption of normality of residuals. In the case of the the qq plot, ideally we would like to see the points fall along the line, since we are seeing a curve it means that the residuals are not normally distributed, violating the assumption of normality.

# Check normality
check_model(m1_ols, check = "normality")

This is another way to check the normality of the residuals. Ideally, we would see the distribution of the residuals (shaded in blue) fall within the bounds of the normal curve. This plot shows a skewed distribution of residuals that doesn’t follow the normal curve, it reinforces that the residuals are not normally distributed. Since there are more negative residuals, this could mean that the model is underestimating the values.

# Check homogeneity
check_model(m1_ols, check = "homogeneity")

This plot shows the homogeneity of variance. Ideally, the reference line would be flat, which would mean that there is little variance of the residuals across the fitted values. Since we are seeing a curve it means that the variance is not constant, violating the assumption of homogeneity of variance.

# Check pp check
check_model(m1_ols, check = "pp_check")

The posterior predictive check shows how well the model-predicted data fits the observed data. In this case, the observed data is significantly higher than the model-predicted data, which means that the model is underestimating the predictions. This could be for a number of reasons but indicates that we should maybe choose a different model that better fits the data and/or we may need to transform the data.


Step 5: Fitting GLMs

a. Estimate a Poisson regression model using the glm() function

# ---- Fit a Poisson regression model with glm ----

m2_pois <- glm(
    counts ~ treat,
    data = spiny_counts,
    family = poisson(link = "log")
)

# Summarize model
summ(m2_pois, model.fit = FALSE)
Observations 252
Dependent variable counts
Type Generalized linear model
Family poisson
Link log
Est. S.E. z val. p
(Intercept) 3.12 0.02 171.74 0.00
treat 0.21 0.03 8.44 0.00
Standard errors: MLE
# Transform the coefficient for IRR to get percent change
exp(0.21) - 1
## [1] 0.2336781

b. Interpret the predictor coefficient in your own words. Use full sentences and write your interpretation of the results to be as clear as possible to a non-academic audience.

The poisson model predicted that there is a 23% increase in lobster abundance at the treatment sites compared to the control.

c. Explain the statistical concept of dispersion and overdispersion in the context of this model. For a poisson model, there is an assumption that the mean and the variance are supposed to be equal. Dispersion describes the spread of data around the mean; and overdispersion is used to describe when the variance is greater than the mean. If there is overdispersion then the poisson model will underestimate the variability.

d. Compare results with previous model, explain change in the significance of the treatment effect

The poisson model predicted that there is a 32% increase in lobster abundance at the treatment sites, while the simple OLS model only estimated a 5% increase. This makes sense because the OLS model had a lot of negative residuals, which indicates that it was underestimating the predictions.

e. Check the model assumptions. Explain results.

# Check the model assumptions
check_model(m2_pois)

Looking at the QQ plot, it indicates that the model violates the assumption of normality, since the residuals are distributed in an “S” curve along the quantiles, rather than following along the line. The PP check shows that in some areas the model is underestimating and others it is significantly overestimating. This model fits the assumption of homogeneity.

f. Conduct tests for over-dispersion & zero-inflation. Explain results.

# Check for overdispersion
check_overdispersion(m2_pois)
## # Overdispersion test
## 
##        dispersion ratio =    67.033
##   Pearson's Chi-Squared = 16758.289
##                 p-value =   < 0.001

There was overdispersion detected in this model which means that the variance is greater than the mean; this indicates that there could be other factors (an omitted variable) influencing count that around included in the model.

# Check for zero inflation
check_zeroinflation(m2_pois)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 0
##             Ratio: 0.00

There was no observed zeros in the response variable, which means that there are other reasons, such as environmental factors, contributing to overdispersion.

g. Fit a negative binomial model using the function glm.nb() from the package MASS and check model diagnostics

# ---- Fit a negative binomial model with glm.nb ----

m3_nb <- glm.nb(
    counts ~ treat,
    data = spiny_counts
)

# Summarize model
summ(m3_nb, model.fit = FALSE)
Observations 252
Dependent variable counts
Type Generalized linear model
Family Negative Binomial(0.55)
Link log
Est. S.E. z val. p
(Intercept) 3.12 0.12 26.40 0.00
treat 0.21 0.17 1.23 0.22
Standard errors: MLE
# transform the coefficient
exp(0.21) - 1
## [1] 0.2336781

h. In 1-2 sentences explain rationale for fitting this GLM model. Binomial models are good to use when data indicates overdispersion because it handles variation better than models like OLS or poisson, like in the case of our lobster count dataset, because the nb model does not have the assumption that the mean = variance.

i. Interpret the treatment estimate result in your own words. Compare with results from the previous model.

The negative binomial model predicted that there is a 32% increase in lobster abundance at the treatment sites which is the same as what the poisson model predicted, while the simple OLS model only estimated a 5.9% increase.

# Check for overdispersion
check_overdispersion(m3_nb)
## # Overdispersion test
## 
##  dispersion ratio = 1.398
##           p-value = 0.088

There is no overdispersion detected in this model, because it accounts for the variance better.

# Check for zero inflation
check_zeroinflation(m3_nb)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 30
##             Ratio: 1.12

There were no observed zeros in this model, which means we dont have to worry about that issue with this model.

# Check posterior predictions
check_predictions(m3_nb)

The pp check shows a much better fit for this model. The model predicted intervals match closely with the observed data points, which means that the model-predicted data fits the observed data well.

# Check model assumptions
check_model(m3_nb)

The checks all match the ideal distributions well, which means that the predicted data and observed values fit the assumptions of the model well.


Step 6: Compare models

a. Use the export_summ() function from the jtools package to look at the three regression models you fit side-by-side.

# Create a table comparing model results
export_summs(m1_ols, m2_pois, m3_nb,
             model.names = c("OLS","Poisson", "NB"),
             statistics = "none")
OLSPoissonNB
(Intercept)22.73 ***3.12 ***3.12 ***
(3.57)   (0.02)   (0.12)   
treat5.36    0.21 ***0.21    
(5.20)   (0.03)   (0.17)   
*** p < 0.001; ** p < 0.01; * p < 0.05.

c. Write a short paragraph comparing the results. Is the treatment effect robust or stable across the model specifications.

For all three models, there was a consistent and statistically significant positive treatment affect. Both the negative binomial model and the poisson model predicted that there is a 32% increase in lobster abundance at the treatment sites, while the simple OLS model only estimated a 5% increase. The poisson and negative binomial models performed similarly, despite the presence of overdispersion, which indicates that the positive treatment effect is robust.


Step 7: Building intuition - fixed effects

a. Create new df with the year variable converted to a factor

b. Run the following OLS model using lm()

# ---- Try again using 'year' as a factor ----
ff_counts <- spiny_counts %>% 
    mutate(year=as_factor(year))
    
m5_fixedeffs <- glm.nb(
    counts ~ 
        treat +
        year +
        treat*year,
    data = ff_counts)

summ(m5_fixedeffs, model.fit = FALSE)
Observations 252
Dependent variable counts
Type Generalized linear model
Family Negative Binomial(0.8129)
Link log
Est. S.E. z val. p
(Intercept) 2.35 0.26 8.89 0.00
treat -1.72 0.42 -4.12 0.00
year2013 -0.35 0.38 -0.93 0.35
year2014 0.08 0.37 0.21 0.84
year2015 0.86 0.37 2.32 0.02
year2016 0.90 0.37 2.43 0.01
year2017 1.56 0.37 4.25 0.00
year2018 1.04 0.37 2.81 0.00
treat:year2013 1.52 0.57 2.66 0.01
treat:year2014 2.14 0.56 3.80 0.00
treat:year2015 2.12 0.56 3.79 0.00
treat:year2016 1.40 0.56 2.50 0.01
treat:year2017 1.55 0.56 2.77 0.01
treat:year2018 2.62 0.56 4.69 0.00
Standard errors: MLE

c. Take a look at the regression output. Each coefficient provides a comparison or the difference in means for a specific sub-group in the data. Informally, describe the what the model has estimated at a conceptual level (NOTE: you do not have to interpret coefficients individually)

# Check model summary
summ(m5_fixedeffs, model.fit = FALSE)
Observations 252
Dependent variable counts
Type Generalized linear model
Family Negative Binomial(0.8129)
Link log
Est. S.E. z val. p
(Intercept) 2.35 0.26 8.89 0.00
treat -1.72 0.42 -4.12 0.00
year2013 -0.35 0.38 -0.93 0.35
year2014 0.08 0.37 0.21 0.84
year2015 0.86 0.37 2.32 0.02
year2016 0.90 0.37 2.43 0.01
year2017 1.56 0.37 4.25 0.00
year2018 1.04 0.37 2.81 0.00
treat:year2013 1.52 0.57 2.66 0.01
treat:year2014 2.14 0.56 3.80 0.00
treat:year2015 2.12 0.56 3.79 0.00
treat:year2016 1.40 0.56 2.50 0.01
treat:year2017 1.55 0.56 2.77 0.01
treat:year2018 2.62 0.56 4.69 0.00
Standard errors: MLE

This model is predicting how the effect of treatment on lobster counts varies by year. For most years, treatment had a positive effect on lobster abundance however. However, the “main effect” is negative.

d. Explain why the main effect for treatment is negative? *Does this result make sense?

This means that for the baseline year, 2012, treatment groups had lower lobster counts than the control group. This means that, on average, the treatment groups started with fewer lobsters which could introduce challenges in comparing the treated and non-treated areas.

e. Look at the model predictions: Use the interact_plot() function from package interactions to plot mean predictions by year and treatment status.

# ---- Plot mean predictions by year and treatment site with interact_plot ----
interact_plot(m5_fixedeffs,
              pred = year,
              modx = treat,
              outcome.scale = "response")

f. Re-evaluate your responses (c) and (b) above. It means that the mpa and non-mpa groups started at different baselines.

g. Using ggplot() create a plot in same style as the previous interaction plot, but displaying the original scale of the outcome variable (lobster counts). This type of plot is commonly used to show how the treatment effect changes across discrete time points (i.e., panel data).

The plot should have… - year on the x-axis - counts on the y-axis - mpa as the grouping variable

# Hint 1: Group counts by `year` and `mpa` and calculate the `mean_count`
# Hint 2: Convert variable `year` to a factor

# Group by year and mpa
plot_counts <- spiny_counts |> 
    group_by(year, mpa) |>
    
    # Calculate mean
    summarize(
        mean_count = mean(counts, na.rm = TRUE)
    ) |>
    
    ungroup()

# Create a line plot of mean count by year
plot_counts |> ggplot(aes(x = year, y = mean_count, color = mpa)) +
    geom_line()


Step 8: Reconsider causal identification assumptions

  1. Discuss whether you think spillover effects are likely in this research context (see Glossary of terms; https://docs.google.com/document/d/1RIudsVcYhWGpqC-Uftk9UTz3PIq6stVyEpT44EPNgpE/edit?usp=sharing)

The spillover effect is likely to be present in this research setting, especially since there are no hard boundary lines around the perimeter of the the study areas. There are environmental factors, some that could be influenced by the treatment, that could influence competition/ecology of the area influencing lobsters to migrate between the treated and control sites, especially for treatment sites that are located next to a control site.

  1. Explain why spillover is an issue for the identification of causal effects

Spillover is an issue because it violates the assumption of no interference between the control and treatment group for causal inference.

  1. How does spillover relate to impact in this research setting?

It is important to consider the spillover effect of our treatment areas when designing our study, so that we can choose control sites/methods that properly account for it.

  1. Discuss the following causal inference assumptions in the context of the MPA treatment effect estimator. Evaluate if each of the assumption are reasonable:

    1. SUTVA: Stable Unit Treatment Value assumption SUTVA requires no-interference, which is described above, and no hidden variations in the treatment.From what I found on wildlife.ca, the two treatment sites have different fishing regulations; while Isla Vista is a “No-Take” zone, Naples allows recreational take of some pelagic fish and for the commercial take of kelp, which I think would violate the requirement for no variations in treatment (since we didn’t account for it in our models). So I would argue that both of these assumptions are not reasonable as is.

    2. Excludability assumption Since the sites are all located in the greater Santa Barbara region, dispersed broadly across the coast and likely have different baselines in variables such as proximity to pollution, habitat disturbance, among other environmental/human factors that may introduce challenges in comparing the sites. I don’t think that excludability is a reasonable assumption as is. ————————————————————————

EXTRA CREDIT

Use the recent lobster abundance data with observations collected up until 2024 (lobster_sbchannel_24.csv) to run an analysis evaluating the effect of MPA status on lobster counts using the same focal variables.

  1. Create a new script for the analysis on the updated data
  2. Run at least 3 regression models & assess model diagnostics
  3. Compare and contrast results with the analysis from the 2012-2018 data sample (~ 2 paragraphs)